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ABSTRACT 

One  approach  to  handling  incomplete  data  occasionally  encountered  in  the 
literature  is  to  treat  the  missing  data  as  parameters  and  to  maximize  the 
complete  data  likelihood  over  missing  data  and  parameters.  This  paper  points 
out  that  although  this  approach  can  be  useful  in  particular  problems,  it  is 
not  a  generally  reliable  approach  to  the  analysis  of  incomplete  data.  In 
particular,  it  does  not  share  the  optimal  properties  of  maximum  likelihood 
estimation,  except  under  the  trivial  asymptotics  in  which  the  proportion  of 
missing  data  goes  to  zero  as  the  sample  size  increases. 
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ON  JOINTLY  ESTIMATING  PARAMETERS  AND  MISSING  DATA 


BY  MAXIMIZING  THE  CUM F I. KT E—  DATA  LIKELIHOOD 
Roderick  J.  A.  Little  and  Donald  B.  Kudin 


1 .  Introduction 

In  the  standard  formulation  of  maximum  likelinood  theory  tor  complete 
data,  the  data  z  are  assumed  to  have  a  distribution  with  density 
f(zj6)  indexed  by  an  unknown  parameter  8.  Having  observed  data  values 
z  =  z,  the  likelihood  ot  8  is  the  density  ot  the  observed  data  regarded  as 
a  function  ot  8,  that  is 


L(6|z)  =  f{z|0)  tor  all  8  .  (1) 

The  maximum  likelihood  estimate  6  ot  6  is  obtained  by  maximizing  (1)  with 
respect  to  0.  we  use  the  term  complete  data  likelihood  to  refer  to  the 
expression  ( 1 ) . 

Now  suppose  that  some  ot  the  values  in  z  are  not  observed.  Let  zm 

denote  the  jnissinq  components  and  zf)  the  observed  (present)  components  where 

z  is  the  observed  value  of  z  .  It  is  not  uncommon  in  the  literature  on 
P  P 

incomplete  data  to  see  the  suggestion  that  estimates  ot  8  can  be  found  by 
treating  the  missing  values  zm  as  parameters  and  maximizing  the  comnlete 
data  likelihood  with  respect  to  8  and  In  symbols,  this  corresponds  to 

maximizing  the  function 


L  (  0,  z  | z  )  =  t (z  ,  z  (0) 

1  m '  p  m  p ' 

with  respect  to  (8,  z  ).  The  classic  example  ot  tms  approach  is  in  the 

m 

analysis  ot  missing  plots  in  analysis  of  variance  where  missing  outcomes 
are  treated  as  parameters  and  then  filled  in  to  allow  computationally 
efficient  methods  to  he  used  for  analysis  (Anderson,  114b;  Bartlett,  1 D J  / ; 
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tlubin,  19/2).  More  recently,  UeClroot  and  <,oel  (19H0)  propose  this  i; broach  as 


one  possibility  tor  the  analysis  ot  a  mixed  up  bivariate  normal  sample,  wnere 
the  missing  data  are  the  indices  that  allow  the  values  ot  the  two  variables  to 
be  paired,  and  a  priori  all  pa l rings  are  assumed  equally  lixely.  Press  an  1 
Scott  (1976)  present  a  Bayesian  analysis  ot  an  incomplete  multivariate  normal 
sample  whicn  is  formally  equivalent  to  maximizing  (2).  Tney  maximize  the 
joint  pos  ter lor  distribution  ot  0  and  zm,  after  sped  tying  a  flat  prior 
distribution  the  parameter  9. 

Although  the  literature  on  missing  plot  analysis  explicitly  recognizes 
the  problems  resulting  from  the  suggested  procedure,  the  more  recent 
literature  can  be  read  as  implying  tnat  maximizing  (2)  over  missing  data  and 
parameters  is  just  as  principled  as  standard  maximum  likelihood  estimation 
from  the  complete  data.  Our  purpose  is  simply  to  point  out  the  joint 
maximization  over  missing  data  ana  parameters  is  not  a  maximum  likelihood 
procedure  in  any  useful  sense  ot  the  term.  It  does  not  in  general  enjoy  the 
optimal  large  sample  properties  ot  maximum  likelihood  estimation,  except  using 
the  trivia L  asymptotics  in  which  the  traction  of  the  data  which  are  missing 
goes  to  zero  as  the  sample  size  increases. 

from  tue  likelihood  [lerspective,  missing  data  z|n  differ  f undamenta  1  ly 
from  parameters  0  in  that  tney  are  random  variables  with  an  a  priori 
speci f ted  probability  distribution.  The  correct  likelihood  is  obtained  by 


integrating  the  missing  data  z  out  of  the  complete  data  likelinood  (1), 
mat  is,  the  correct  1 1  Kell  hood  is 


1,  (Ojz  )  -  /  r(7.  ,  x  |  9 )dz  ,  tor  all  J 
.1  n  *  in  p  n 


i‘*n  «-•  t  orr  m  ’  a  t.  Lon  l  f  ll'itly  r  :  nrtv;  t.i.it  th»*  missinq  data  ij'-’  .,p^ir  1  at 
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wnich  are  observed.  I£  the  missing  data  are  not  missing  at  random,  then  the 
model  formulation  needs  to  include  a  distribution  for  the  set  of  variables 
indicating  whether  values  are  observed  or  missing.  For  details,  see  Rubin 
(1976). 

Assuming  the  missing  data  are  missing  at  random,  L2  given  by  (3)  is 
equal  to  the  probability  density  of  the  observed  data  zp  regarded  as  a 
function  of  the  unknown  parameter,  that  is,  of  quantities  not  having  a 
probability  distribution.  Hence  L2  and  not  L,  is  the  true  likelihood  of 
0  given  incomplete  data  z^.  In  the  next  section  we  compare  parameter 
estimates  of  0  found  by  maximizing  with  maximum  Likelihood  estimates 

found  hy  maximizing  L2  for  some  simple  problems. 

2.  Examples 

Example  1 .  Univariate  Normal  Sample 

Suppose  that  z  consists  of  N  observations  from  a  Normal  distribution 

2 

with  mean  u  and  variance  a  ,  zp  consists  of  n  observations  which  are 

observed  and  z^  represents  N-n  missing  observations  which  are  assumed 

_  2 

missing  at  random,  bet  z  and  denote  the  sample  mean  and  sample 

variance  (with  denominator  n)  of  the  n  observed  values.  Then 
2 

0  =  (u,  o  ),  and  maximizing  L2  leads  to  maximum  likelihood  estimates 
_  2  2 

U  =  z,  a  =  .  In  contrast,  maximizing  with  respect  to  9  and  zm 

yields  a  common  estimate  z  for  all  components  of  zw,  and  estimates 
—  2  2 

U  =  z,  a  =  s^n/N).  Thus  the  maximum  likelihood  estimate  of  the  mean  is 

obtained,  but  the  maximum  likelihood  estimate  of  the  variance  is  multiplied  by 

the  Iceetion  of  observed  data.  When  the  fraction  of  missing  data  is 

2 

substantial  (for  example,  n/H  =  0.6),  the  estimated  variance  0  is  badly 

biased,  and  this  bias  does  not  vanish  as  \  j>  unless  n/N  ♦  0;  more 
relevant  asymptotics  wouli  fix  n/N  as  the  sample  size  increases. 


Example  2.  Missing  Plot  Analysis  of  Variance 

Suppose  we  add  to  the  previous  example  a  set  of  covariates  x  which  is 

observed  tor  all  N  observations.  We  assume  that  the  value  of  z  for 

T 

observation  i  with  covariate  values  x,  is  Normal  with  mean  0,,  +  0  x.  and 

i  0  i 

variance  0^.  The  estimates  of  0^  and  0  obtained  by  maximizing  L1  are 
the  maximum  likelihood  estimates,  obtained  by  least  squares  reyression  with 
the  n  observed  data  points.  However,  as  in  Example  1,  the  estimate  of 
variance  is  the  maximum  likelihood  estimate  multiplied  by  the  proportion  of 
observed  values. 

These  results  provide  one  justification  for  the  analysis  of  missing  plots 
in  analysis  of  variance  desiqns  mentioned  in  section  1:  jointly  estimating 
the  values  of  the  outcome  variable  tor  the  missing  plots  and  the  parameters 
leads  to  maximum  likelihood  estimates  of  the  effects  0.  However  an 

2 

adjustment  is  needed  to  the  resulting  estimate  of  the  residual  variance  0  , 
as  the  literature  on  missing  plot  analysis  explicitly  recognizes. 

Example  3.  An  Exponential  Sample 

In  the  first  two  examples  estimation  based  on  maximizing  1^  at  least 
yields  reasonable  estimates  of  location,  even  though  estimates  of  the  scale 
parameter  need  adjustment.  However  in  other  examples,  estimates  of  location 
can  also  be  biased.  For  example,  consider  a  censored  sample  from  an 
exponential  distribution  with  mean  U,  where  z^  represents  the  n  observed 
values  which  lie  below  a  known  censoring  |x>int  c,  and  z  represents  the 
N-n  values  beyond  c  which  are  censored.  The  maximum  likelihood  estimate  ot 
M  is  M  =  z  +  (N-n)c/n.  Maximization  ot  leads  to  estimating  censored 

values  ot  z  at  the  censoring  point  c,  and  estimating  M  by  (n/h)u.  Thus 
in  this  case  the  estimate  of  the  mean  is  inconsistent  unless  the  proportion  of 
missing  values  tends  to  zero  as  the  sample  size  increases. 
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Example  4.  A  Bivariate  Normal  Sample  wi th  Missing  Predictor  Variables. 

Biased  estimates  of  location  parameters  can  also  occur  in  problems 

invoiviny  the  normal  distribution.  For  example,  suppose  that  =  (x^,  )  i 

=  1,...,h  are  N  observations  trom  a  bivariate  normal  distribution  with  mean 

( U  ,  V  'll  variances  a2  and  a2,  ana  correlation  P,  where  y;  is 

x  y  x  y  '  7 1 

observed  for  all  N  observations,  and  Xj,...,x  are  observed  but 

xn+1'***'xh  are  at  random.  Suppose  that  interest  is  focussed  on  the 

2  2 

regression  coetficient  of  y,  on  x.,  8  =  po  /a  =  B  0  /a  .  The 

i  l  y.x  y  x  x.y  y  x 

maximum  likelihood  estimate  of  B  is 

y.x 


3 

y.x 


B 

x.y 


'2  ' 

a  /o 

y 


2 

x 


where  8  =  )  (x.-x)y./  I  (x.-x)2,  x=  V 

x.y  .  .  l  l  .  _  l  u 


i  =  1 


i  =  l 


i=1 


X.  02  =  N-1 

i  y 


1  (y-y)  r 


i=1 


—  _i  N  *1-2-2  -1°  -  1 

y  =  N  \  y.;  and  o  =  3  a  +n  £  (x.-B  y.). 

i.  x  x*vy  i.x«  v  i 

i=1  1  i=1  y 


Maximization  of  (2)  over  parameters  and  data  yields  tor  estimated  B_ 

-  »  ‘•2  *  2 

B  =  B  a  /o  , 
y.x  y.x  x  x 


y.x 


where  a2  =  B2  a2  +  N  1  V  (x  -3  y. )2.  The  estimate 
x  x.y  y  l  x.y  i 

i  =  1 


B  can  be  badly 
y.  x 


biased,  ana  again  this  bias  persists  as  N  +  00  unless  the  fraction  of  missing 
observations  tends  to  zero. 

This  example  is  a  special  case  of  the  problem  considered  by  Press  and 
•Scott  (  1  97o ) .  They  observe  that  tor  the  general  problem  they  considered  their 
estimates  based  on  maximizing  are  consistent  only  if  the  traction  of 

missing  observations  tends  to  zero.  The  correct  maximum  likelihood  approach, 
as  discussed  by  Trawi  risk  l  and  Bargman  (1964),  Hartley  and  Hocking  (1971), 
orchard  and  Woodbury  (1972),  scale  and  blttie  (1976)  and  Pomps  ter,  I  a i rd  and 
•oioin  (1977)  leans  to  estimates  which  are  consistent  as  the  sample  size 
increases  with  the  fraction  of  missing  data  held  constant. 


-6- 


3.  Missing  Values  ds  Parameters 


Both  maximum  likelihood  and  the  maximization  of  L1  over  parameters  and 
missing  data  assumes  the  existence  of  a  model  that  specifies  a  distribution 
for  the  observed  and  missing  values  of  z.  Occasionally  it  is  possible  that 
situations  will  arise  when  it  may  be  desirable  to  avoid  specifying  a 
distribution  for  the  missing  values  and  to  treat  them  as  genuine  unknown 
parameters.  Hartley  and  Hocking  (1971,  section  4  and  5)  discuss  the 
regression  of  y^  on  x^,  where  the  values  x^  correspond  to  fixed  points 
in  an  experimental  design,  y^  is  observed  for  all  units  i  and  components 
of  x^  are  missing  for  some  units.  Writing  x  and  x^  for  the  present  and 
missing  values  of  x,  respectively.  Hartley  and  Hocking  ( 1 y 7 1 )  suggest 
drawing  inferences  by  maximizing  the  complete  data  likelihood  based  on  the 
conditional  distribution  of  y  given  x 

L3(6'Xm|y'Xp)  =  f(yIVXp;0)  (4> 

with  respect  to  xm  and  the  parameters  0.  Hartley  and  Hocking  discuss 

analyses  where  values  of  xm  are  unconstrained  or  are  constrained  to  be  any 
of  k  alternatives.  We  believe  that  in  most  practical  situations  it  is  more 
natural  to  include  a  distribution  for  the  missing  values  in  the  model  (Rubin, 
1971).  From  a  strict  likelihood  perspective,  however,  there  is  no  reason  in 
principle  to  reject  inferences  based  on  (4).  The  question  of  whether  xm 
should  be  treated  as  fixed  or  integrated  out  of  the  likelihood  (as  in  (2)) 
relates  to  the  more  general  issue  of  statistical  inference  in  the  presence  of 
nuisance  parameters,  which  lies  outside  the  scope  of  this  note. 
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